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Abstract 

This paper describes the design of a low-power micro- 
processor system that can run between 8Mhz at J. IV 
and 100MHz at 3.3 V The ramifications of Dynamic 
Voltage Scaling, which allows the processor to dynami- 
cally alter its operating voltage at run-time, will be pre- 
sented along with a description of the system design 
and an approach to benchmarking. In addition, a more 
in-depth discussion of the cache memory system will be 
given. 

1. Introduction 

Our design goal is the implementation of a low- 
power microprocessor for embedded systems. It is esti- 
mated that the processor will consume 1.8mW at 1.1 V/ 
8MHz and 220m W at 3.3V/100MHz using a 0.6 urn 
CMOS process. This paper discusses the system design, 
cache op:im:zation, and the processor's Dynamic Voli- 
age Scaling (DVS) ability. 

In CMOS design, the energy-per-operation is 
given by the equation 
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where C is the switched capacitance and V is the operat- 
ing voltage [2]. To minimize E op , we use aggressive 
low-power design techniques to reduce C and DVS to 
optimize V. 

Our system design, which addresses the complete 
microprocessor system and not just the processor core, 
is presented in Section 2. Our benchmark suite, which is 
designed for a DVS embedded system, is presented in 
Section 3. Section 4 discusses the issues involved with 
the implementation of DVS, while Section 5 presents an 
in-depth discussion of our cache design. 

The basic goal of DVS is to quickly (~10u,s) 
adjust the processor's operating voltage at run-time to 
the minimum level of performance required by the 
application. By continually adapting to the varying per- 
formance demands of the application energy efficiency 
is maximized. 

The main difference between our design and that 
of the StrongARM is the power/performance target: our 
system targets ultra- low power consumption with mod- 
erate performance while the StrongARM targets moder- 
ate power consumption with high performance. Our 
processor core is based on the ARM8 architecture [1], 



which is virtually identical to that of the StrongARM. 
The similarities and differences between the two designs 
are highlighted throughout this paper. 

2. System Overview 

To effectively optimize system energy, it is neces- 
sary to consider all of the critical components: there is 
little benefit in optimizing the microprocessor core if 
other required elements dominate the energy consump- 
tion. For this reason, we have included the microproces- 
sor core, data cache, processor bus, and external SRAM 
in our design, as seen in Figure 1 . The energy consumed 
by the I/O system (not shown) is completely application 
and device dependent and is therefore beyond the scope 
of our work. The expected power distribution of our sys- 
tem is given in Figure 2. 
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Figure 1 : System Block Diagram 
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Figure 2: System Energy Breakdown 
To reduce the energy consumption of the memory 
system, we use a highly optimized SRAM design [3] 
which is 32 data-bits wide, requiring only one device be 
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activated for each access. Schemes that use multiple 
narrow-width SRAMs require multiple devices to be 
activated for each access, resulting in 1 a significant 
increases in the energy consumption. To alleviate the 
high pin count problem of 32-bit memory devices, we 
multiplex the data address onto the same bit-lines as the 
data words. 

We use a custom designed high-efficiency non- 
linear switching voltage regulator [14] to generate 
dynamic supply voltages between 1.1V and 3.3V. An 
efficient regulator is crucial to an efficient system 
because all energy consumed is channeled through the 
regulator. When switching from 3.3V to 1.1V, a linear 
regulator would only realize a 3x energy savings, 
instead of the 1 2x reduction afforded by our design. 

The threshold voltage (K,) significantly effects 
the energy and performance of a CMOS circuit. Our 
design uses a V t of 0.8V to achieve a balance between 
performance and energy consumption. The StrongARM 
[13], for comparison, uses a V f of 0.35V, which 
increases performance at the expense of increased static 
power consumption. When idle, the StrongARM is 
reported to consume 20mW, which is the predicted 
power consumption of our processor when running at 
20MHz. When idle, we estimate our processor will con- 
sume 200u.w, an order of magnitude improvement. 

3. Benchmarks 

Our benchmark suite targets PDAs and embedded 
applications. Benchmark suites such as SPEC95 are not 
appropriate for our uses because they are batch -oriented 
and target high-performance workstations. DVS evalua- 
tion requires the benchmarking of workload idle charac- 
teristics, which is not possible with batch-oriented 
benchmarks. Additionally, our target device has on the 
order of 1MB of memory and lacks much of the system 
support required by heavy-weight benchmarks; running 
SPEC95 on our target device would simply be impracti- 
cal. 

We feel the following six benchmarks are needed 
to adequately represent the range of workloads found in 
embedded systems: 

• AUDIO Decryption 

• MPEG Decoding 

• User Interfaces 

• Java Interpreter 

• Web Browser 

• Graphics Primitive Rendering 

As of this writing, we have implemented the first 
three of these and their characteristics are summarized 
in Table 3. "Idle Time" represents the portion of system 
idle time, used by DVS algorithms. The "Bus Activity" 
column reports the fraction of active cycles on the exter- 
nal processor bus, an important metric when optimizing 
the cache system. The cache architecture used to gener- 
ate Table 3 is discussed in Section 5. 

As an example, Figure 4 shows an event impulse 
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AUDIO 


0.23% 


67% 


0.35% 


MPEG 


1 .7% 


22% 


14% 


UI 


0.62% 


95% 


0.52% 



Table 3: Benchmark Characterization 

graph [13], which is used to help characterize programs 
for DVS analysis. Each impulse represents one MPEG 
frame and indicates the amount of work necessary to 
process that frame. For this example, there is a fixed 
frame-rate which can be used to calculate the optimal 
processor speed for each frame, assuming only one out- 
standing frame at any given time. 
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Figure 4: MPEG Event Impulse Graph 

4. Dynamic Voltage Scaling 

Our processor has the ability, termed Dynamic 
Voltage Scaling (DVS), to alter it's execution voltage 
while in operation. This ability allows the processor to 
operate at the optimal energy/efficiency point and real- 
ize significant energy savings, which can be as much as 
80% for some applications [13]. This section discusses 
DVS design considerations and explains how it affects 
architectural performance evaluations. 

DVS combines two equations of sub-micron 
CMOS design [2]: 

, V ~ V t 
E oc V 1 and f « 

**op J max y 

where E op is the energy-per-operation, f max is the 
maximum clock frequency, and V is the operating volt- 
age. To minimize the energy consumed by a given task, 
we can reduce V, affecting a reduction in E op . A 
reduction in V , as shown in the second equation, results 
in a corresponding decrease in f max . A simple example 
of these effects is given below. 

Reducing f cJk1 the actual processor clock used, 
without reducing V does not reduce the energy con- 
sumed by a processor for a given task. The 
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StrongARM 1 1 00, for example, allows f c{k to be 
dynamically altered during operation [16], affecting a 
linear reduction in the power consumed. However, the 
change in f cfk also causes a linear increase in task run- 
time, causing the energy-per-task to remain constant. 
Our system always runs with f c{k = f nwx , which min- 
imizes the energy consumed by a task. 

From a software perspective, we have abstracted 
away the voltage parameter and specify the operating 
point in terms of f max . The actual voltage used is deter- 
mined by a feedback loop driven by a simple ring oscil- 
lator. The primary reason for this design was ease of the 
hardware implementation; fortunately, it also presents 
the most useful software interface. 

Our system applies one dynamic voltage to the 
entire system to realize savings from all components. It 
would be possible, however, to use multiple independent 
supply voltages to independently meet subsystem per- 
formance requirements. This was not attempted in our 
design. To interface with DVS-incompatible external 
components we use custom designed level-converting 
circuits. 

The implementation of DVS requires the applica- 
tion of voltage scheduling algorithms. These algorithms, 
discussed in Section 4.2, monitor the current and 
expected state of the system to determine the optimal 
operating voltage (frequency). 

4.1 Energy/Performance Evaluation Under DVS 

DVS can affect the way we analyze architectural 
trade-offs. As an example, we explore the interaction 
between DVS and the ARM Thumb [4] instruction set. 
We apply Thumb to the MPEG benchmark from 
Section 3 and analyze the energy consumed. This exam- 
ple assumes a 32-bit memory system, which is a valid 
assumption for high-performance systems but not nec- 
essarily for all embedded designs. 

The MPEG benchmark is 22% idle when running 
at 100 MHz using the 32-bit ARM instruction set. DVS 
allows us to minimize the operating voltage to fill 
unnecessary idle-time. Using a first-order approxima- 
tion, this would reduce the energy consumed by 40% 
and slow down the processor clock to the point at which 
idle time is zero. From this starting point, we consider 
the application of the Thumb instruction set to this 
benchmark. 

For typical programs, the 16-bit Thumb instruc- 
tion-set is 30% more dense than it's 32-bit counterpart, 
reducing the energy consumed in the cache and memory 
hierarchy. However, due to reduced functionality, the 
number of instructions executed increases by roughly 
1 8%, increasing the energy dissipated in the processor 
core as well as the task execution time. 

This example will teach two important lessons. 
First, an increase in task delay directly relates to an 
increase in energy: DVS exposes the trade-off between 
energy and performance. Second, an increase in delay 



affects the entire system (core and cache), not just one 
fragment: it is vital that the associated increase in the 
energy-per-operation is applied to the entire system. 

Figure 5 presents six metrics crossed with three 
configurations running the MPEG benchmark. The three 
configurations are: 

• Base: 78 MHz using 32-bit instructions. 

• Thumb: 78 MHz using Thumb instructions. 

• Adjusted: 92 MHz using Thumb instructions. 




Processor Processor Energy Per Cache Core Total 
Utilization Speed Operation ' Energy Energy Energy 
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Figure 5: DVS Example 

The 'Base' configuration represents the MPEG 
benchmark running as 32-bit code, as discussed above. . 
Thumb' illustrates the intermediate effects of the 16-bit 
Thumb architecture without increasing the clock speed. 
The energy consumed in ihe cache (see Figure 5)^ T 
decreases due to the decreased memory bandwidth 
caused by the smaller code size. The energy of the core, 
however, rises slightly due to the increased number of 
instructions processed. Overall, the energy decreases by 
approximately 10%. 

The delay increase caused by the expanded 
instruction stream pushes the processor utilization over 
100%. Because of this, the MPEG application will not 
be able to process its video frames fast enough. The 
'Adjusted' configuration represents the increase in pro- 
cessor speed required to maintain performance. This 
change in clock frequency necessitates an increase in 
voltage which raises the energy-per-operation of the 
entire system. As can be seen from the Total Energy' 
columns, the energy savings are no longer realized: the 
16-bit architecture increases overall energy consump- 
tion. 

Although not energy-efficient in all situations, the 
Thumb instruction set may be efficient for some tasks 
due to the non-linearity of voltage-scaling. If the base 
system were initially running at a very low voltage, for 
example, the increase in processor speed necessary 
would not dramatically increase the energy-per-opera- 
tion. The savings due to the reduced code-size, there- 
fore, would affect an overall decrease in system energy. 

4.2 Voltage Scheduling 

To effectively control DVS, a voltage scheduler is 
used to dynamically adjust the processor speed and volt- 
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age at run-time. Voltage scheduling significantly com- 
plicates the scheduling task since it allows optimization 
of the processor clock rate. Voltage schedulers analyze 
the current and past state of the system in order to pre- 
dict the future workload of the processor. 

Interval-based voltage schedulers are simple 
techniques that periodically analyze system utilization 
at a global level: no direct knowledge of individual 
threads or programs is needed. If the preceding time 
interval was greater than 50% active, for example, the 
algorithm might increase the processors speed and volt- 
age for the next time interval. [5][13][17] analyze the 
effectiveness of this scheduling technique across a vari- 
ety of workloads. Interval-based scheduling has the 
advantage of being easy to implement, but it often has 
the difficulty of incorrectly predicting future workloads. 

More recently, investigation has begun into 
thread-based voltage schedulers, which require knowl- 
edge of individual thread deadlines and computation 
required [7][12]. Given such information, thread-based 
schedulers can calculate the optimal speed and voltage 
setting, resulting in minimized energy consumption. A 
sample deadline-based voltage scheduling graph is 
given in Figure 6; S x and D x represent task start-time 
and deadline, respectively, while the graph area, C T , rep- 
resents computational resources required. 
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Figure 6: The Voltage Scheduling Graph 

43 Circuit Level Considerations 

At the circuit level, there are two types of compo- 
nents in our design adversely affected by DVS: complex 
logic gates and memory sense-amps. Complex logic 
gates, such as 8-input NAND gates, are implemented by 
a CMOS transistor chain which will have a different rel- 
ative delay if the voltage is varied. Additionally, mem- 
ory sense-amps are sensitive to voltage variations 
because of their analog nature, which is necessary to 
detect the small voltage fluctuations of the memory 
cells. 

To the largest extent possible, these voltage sensi- 
tive circuits are avoided; however, in some situations, 
such as in the cache CAM design described below, it is 
better to redesign the required components with 
increased tolerance. Redesigns of these components will 
often be less efficient or slower than the original version 
when running at a fixed voltage. We estimate an 
increase in the average energy/instruction of the micro- 
processor on the order of 10%, which is justified by the 



overall savings afforded by DVS. 

5. Cache Design 

This section describes the design of our cache 
system, which is a 16kB unified 32-way set-associative 
read-allocate write-back cache with a 32-byte line size. 
The cache is an important component to optimize since 
it consumes roughly 33% of the system power and is 
central to system performance. Our primary design goal 
was to optimize for low-power while maintaining per- 
formance; our cache analysis is based on layout capaci- 
tance estimates and aggregated benchmark statistics. 

Our 16kB cache is divided into 16 individual lkB 
biocks. The 1 kB block-size was chosen to achieve a bal- 
ance between block access energy and global routing 
energy. Increasing the block-size would decrease the 
capacitance of the global routing but it would also 
increase the energy-per-access of the individual blocks. 

Our cache geometry is very similar to that of the 
StrongARM, which has a split 16kB/16kB instruction/ 
data cache. Other features, namely the 32-way associa- 
tive CAM array, are similar. In the StrongARM design, 
the caches consume approximately .47% of the system 
power [13]. 

5.1 Basic Cache Structure 

We have discovered that a CAM based cache 
design (our implementation is given in Figure 7) is more 
efficient than a traditional set-associative organization 
(Figure 8) in terms of both power and performance. The 
fundamental drawback with the traditional design is that 
the energy per access scales linearly with the associativ- 
ity: multiple tags and data must be fetched simulta- 
neously to maintain cycle time. A direct-mapped cache, 
therefore, would be extremely energy efficient; its per- 
formance, however, would be unacceptable. We esti- 
mate that the energy of our 32-way set-associative 
design is comparable to that of a 2-way set-associative 
traditional design. 
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Figure 7: Implemented CAM Cache Design 
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Figure 8: Traditional Set-Associative Cache Design 

Our design has been modified from a vanilla 
CAM in two major ways: 

• Narrow memory bank: The fundamental SRAM 
data block for our design is organized as a 2-word x 
1 28-row block, instead of a 8-word x 32-row block. 

• Inhibited tag checks: Back-to-back accesses map- 
ping to the same cache line do not trigger multiple 
tag checks. 

The 2-word by 128-row block organization for 
our cache data was chosen primarily because a large 
block width would increase the energy-per-access to the 
data block. A block width of 8 words, for example, 
would effectively entail fetching 8 words per access, 
which is wasteful since only one or two of these words 
would be used. The narrow block width unfortunately 
causes an irregular physical layout, increasing total 
cache area; however, we chose this design as energy was 
our primary concern. 

There are two natural lower-bounds on the block 
width. First, the physical implementation of the SRAM 
block has an inherent minimum width of 2-words [3]. 
Second, the ARM 8 architecture has the capability for 
double-bandwidth instruction fetches and data reads [1], 
which lends itself to a 2-word per access implementa- 
tion. 

Unnecessary tag checks, which would waste 
energy, are inhibited for temporally sequential accesses 
that map to the same cache line. Using the sequential- 
access signal provided by the processor core and a small 
number of access bits, this condition can be detected 
without a full address comparison. Our simulations indi- 
cate that about 46% of the tag checks are avoided with a 
8-word cache line size, aggregated across both instruc- 
tion and data accesses. For the individual instruction and 
data streams, 61% and 8% of tag checks are prevented, 
respectively. 

5.2 Cache Policies and Geometry 

Cache energy has a roughly logarithmic relation- 



ship with respect to its overall size, due to selective 
block enabling: a 16kB cache consumes little more 
energy than an 8kB cache. Our fundamental cache size 
constraint was die cost, which is determined primarily 
by cache area. Benchmark simulations indicate that a 
16kB unified cache is sufficient; we felt the increased 
cost of a 32kB cache was not justified. We chose a uni- 
fied cache because it is most compatible with the ARM8 
architecture. 

The cache line size has a wide-ranging impact on 
energy efficiency; our analysis (Figure 9) indicates that 
an 8- word line size is optimal for our workload. Given 
the lkB block size, our associativity is inversely proporr 
tional to the line size: an 8-word line yields 32-way 
associativity (lkB / 8-words = 32-way). The energy of a 
CAM tag access is roughly linear with associativity. 
Also, smaller cache line sizes generate less external bus 
traffic, consuming less energy. The energy of the data 
memory is practically constant, although there are slight 
variations caused by updates due to cache misses. 
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Figure 9: Line-Size Energy Breakdown 

We implement a write-back cache to minimize 
external bus traffic. Our simulations indicate that a 
write-through cache would increase the external bus 
traffic by approximately 4x, increasing the energy of the 
entire system by 27%. We found no observable perfor- 
mance difference between the two policies. 

Our simulations find no significant evidence 
either for or against read-allocate in terms of energy or 
performance; we implement read-allocate to simplify 
the internal implementation. Similarly, we find that 
round-robin replacement performance is comparable to 
that of both LRU and random replacement, due to the 
large associativity. 

5.3 Related Work 

Most low-power cache literature {6][9][1 5][8] 
suggests improvements to the standard set-associative 
cache model of Figure 8. The architectural improve- 
ments proposed center around the concepts of sub-bank- 
ing and row-buffering. Sub-banking retrieves only the 
required portion of a cache line, saving energy by not 
extraneously fetching data. Row-buffering fetches and 
saves an entire cache line to avoid future unnecessary 
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tag comparisons. 

Our CAM-based cache design indirectly imple- 
ments the concepts of sub-banking and row-buffering. 
The 2-wbrd block size of our memory bank is similar to 
2-word sub-banking. Tag checks inhibition is similar to 
row-buffering: only one tag-check is required for each 
cache-line access. 

[8] presents a technique for reducing the energy 
of CAM-based TLBs by restricting the effective asso- 
ciativity of the parallel tag compare and modifying the 
internal CAM block. Due to time constraints, these 
modifications were not considered for our design. 

6. Conclusion 

This paper describes the implementation of a 
low-power Dynamic Voltage Scaling (DVS) micropro- 
cessor. Our analysis encompasses the entire micropro- 
cessor system, including the memory hierarchy and 
processor core. We use a custom benchmark suite 
appropriate for our target application: a portable embed- 
ded system. 

Dynamic Voltage Scaling allows our processor to 
operate at maximal efficiency without limiting peak per- 
formance. Understanding the fluid relationship between 
energy and performance is crucial when making archi- 
tectural design decisions. A new class of algorithms, 
termed voltage schedulers, are required to effectively 
control DVS. 

A description of our cache design was given 
which presents the architectural and circuit trade-offs 
with energy and performance for our application 
domain. For minimized energy consumption, we found 
that a CAM-based cache design is more energy efficient 
than a traditional set-associative configuration. 
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